Analyse data for Hilton International Hotels to provide strategy recommendations that will lead to revenue growth.
The solutions to the following questions will help us answer our research question:
You are a Data Scientist working for Hilton International Hotels and need to help the management decide on the strategies that will lead to revenue growth.
You have a dataset containing information on the various chain of hotels, including when customers made the booking, length of stay, the number of adults, children, or babies, and the number of available parking spaces, among other things.
We will first perform data exploration, data cleaning, and analysis to come with appropriate recommendations.
The data provided was relevant to the research question since it answers the research questions
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
# install.packages("reshape")
library(reshape)
##
## Attaching package: 'reshape'
##
## The following object is masked from 'package:dplyr':
##
## rename
##
## The following objects are masked from 'package:tidyr':
##
## expand, smiths
# install.packages("knitr")
# install.packages("rmarkdown")
library(knitr)
library(rmarkdown)
# Load the data below
# ---
# Dataset url = https://bit.ly/2WvQbhJ
# ---
# YOUR CODE GOES BELOW
#
hotel_booking_df <- read.csv("hotel_bookings.csv", header = TRUE, na.strings=c("NULL",NA))
# Checking the first 5 rows of data
# ---
# YOUR CODE GOES BELOW
#
head(hotel_booking_df, 5)
# Checking the last 5 rows of data
# ---
# YOUR CODE GOES BELOW
#
tail(hotel_booking_df, 5)
# Sample 10 rows of data
# ---
# YOUR CODE GOES BELOW
#
sample_n(hotel_booking_df, 10)
# Checking number of rows and columns
# ---
# YOUR CODE GOES BELOW
#
dim(hotel_booking_df)
## [1] 119390 32
# Checking datatypes
# ---
# YOUR CODE GOES BELOW
#
glimpse(hotel_booking_df)
## Rows: 119,390
## Columns: 32
## $ hotel <chr> "Resort Hotel", "Resort Hotel", "Resort…
## $ is_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
## $ lead_time <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, …
## $ arrival_date_year <int> 2015, 2015, 2015, 2015, 2015, 2015, 201…
## $ arrival_date_month <chr> "July", "July", "July", "July", "July",…
## $ arrival_date_week_number <int> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
## $ arrival_date_day_of_month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ stays_in_weekend_nights <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stays_in_week_nights <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
## $ adults <int> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ children <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ babies <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB…
## $ country <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR…
## $ market_segment <chr> "Direct", "Direct", "Direct", "Corporat…
## $ distribution_channel <chr> "Direct", "Direct", "Direct", "Corporat…
## $ is_repeated_guest <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_cancellations <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reserved_room_type <chr> "C", "C", "A", "A", "A", "A", "C", "C",…
## $ assigned_room_type <chr> "C", "C", "C", "A", "A", "A", "C", "C",…
## $ booking_changes <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ deposit_type <chr> "No Deposit", "No Deposit", "No Deposit…
## $ agent <int> NA, NA, NA, 304, 240, 240, NA, 303, 240…
## $ company <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ days_in_waiting_list <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ customer_type <chr> "Transient", "Transient", "Transient", …
## $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,…
## $ required_car_parking_spaces <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ total_of_special_requests <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, …
## $ reservation_status <chr> "Check-Out", "Check-Out", "Check-Out", …
## $ reservation_status_date <chr> "2015-07-01", "2015-07-01", "2015-07-02…
Record your general observations below:
Observation 1: There are about 120 thousand records in the data set with 32 fields for every record. Each field seem to have the correct data type Observation 2: Some of the Values in the dataset are either null or missing
The data is originally from the article Hotel Booking Demand Datasets, by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019.
# Checking datatypes and missing entries of all the variables
# ---
# YOUR CODE GOES BELOW
#
str(hotel_booking_df)
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr "July" "July" "July" "July" ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr "BB" "BB" "BB" "BB" ...
## $ country : chr "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr "C" "C" "A" "A" ...
## $ assigned_room_type : chr "C" "C" "C" "A" ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : int NA NA NA 304 240 240 NA 303 240 15 ...
## $ company : int NA NA NA NA NA NA NA NA NA NA ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
sum(is.na(hotel_booking_df))
## [1] 129425
We observe the following from our dataset:
Observation 1: Correct datatypes are assigned for the different fields Observation 2: There are 129425 missing values in our data set
# Checking how many duplicate rows are there in the data
# ---
# YOUR CODE GOES BELOW
#
# hotel_booking_df[duplicated(hotel_booking_df), ]
sum(duplicated(hotel_booking_df))
## [1] 31994
We choose to keep the duplicates because we don’t have a unique identifier to actually proof that we have duplicates.
# Checking if any of the columns are all null
# ---
# YOUR CODE GOES BELOW
#
which(colSums(is.na(hotel_booking_df)) == ncol(hotel_booking_df))
## named integer(0)
We observe the following from our dataset:
Observation 1: There are no columns in which all the data is null
# Checking if any of the rows are all null
# ---
# YOUR CODE GOES BELOW
which(rowSums(is.na(hotel_booking_df)) == nrow(hotel_booking_df))
## integer(0)
We observe the following from our dataset:
Observation 1: there is no row in which all the data is null
# Checking the correlation of the features through the use of
# visualizations the correlation using heatmap
# ---
# YOUR CODE GOES BELOW
#
head(hotel_booking_df)
# Select the desired columns
hotel_booking_df2 <- hotel_booking_df %>%
na.omit() %>%
select(is_canceled, stays_in_weekend_nights, stays_in_week_nights, adults, children, babies, is_repeated_guest, previous_cancellations, previous_bookings_not_canceled, days_in_waiting_list, total_of_special_requests)
hotel_booking_df2
# Create a correlation matrix
corr_matrix <- cor(hotel_booking_df2, method="s")
## Warning in cor(hotel_booking_df2, method = "s"): the standard deviation is zero
head(corr_matrix)
## is_canceled stays_in_weekend_nights
## is_canceled 1.00000000 -0.1438928
## stays_in_weekend_nights -0.14389278 1.0000000
## stays_in_week_nights -0.08137791 0.7794933
## adults -0.04922699 -0.0664910
## children 0.05446082 -0.1683479
## babies NA NA
## stays_in_week_nights adults children babies
## is_canceled -0.08137791 -0.04922699 0.05446082 NA
## stays_in_weekend_nights 0.77949325 -0.06649100 -0.16834790 NA
## stays_in_week_nights 1.00000000 -0.15183455 -0.12611588 NA
## adults -0.15183455 1.00000000 0.20245466 NA
## children -0.12611588 0.20245466 1.00000000 NA
## babies NA NA NA 1
## is_repeated_guest previous_cancellations
## is_canceled -0.01183921 -0.01983730
## stays_in_weekend_nights -0.13333459 -0.02758733
## stays_in_week_nights -0.26416244 -0.08427287
## adults -0.07395879 -0.05488996
## children 0.17358125 -0.01147280
## babies NA NA
## previous_bookings_not_canceled days_in_waiting_list
## is_canceled -0.005786814 NA
## stays_in_weekend_nights -0.129795621 NA
## stays_in_week_nights -0.233541874 NA
## adults -0.133680418 NA
## children 0.190626508 NA
## babies NA NA
## total_of_special_requests
## is_canceled -0.11646585
## stays_in_weekend_nights -0.12033640
## stays_in_week_nights -0.24581095
## adults 0.01722921
## children 0.09985491
## babies NA
# Create a table with long form
corr_df <- melt(corr_matrix)
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
corr_df
# Plot the heatmap
ggplot(corr_df, aes(X1, X2, fill = value)) +
geom_tile(color = "black") +
geom_text(
aes(label = round(value, 2)),
color = "pink"
) +
coord_fixed() +
labs(
fill="Pearson Correlation"
) +
scale_fill_gradient2(
low = "blue",
high = "red",
mid = "white",
limit = c(-1,1)
) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank()
)
## Warning: Removed 38 rows containing missing values (geom_text).
We observe the following from our dataset:
There is weak correlation (both positive and negative) between the various fields with the exception of “previous_bookings_not_canceled” and “is_repeated_guest” where we have a strong positive correlation
# Dropping company column because it has alot of missing values
# and we won't need to answer any of our questions
# ---
# YOUR CODE GOES BELOW
#
hotel_booking_df = select(hotel_booking_df, -company)
head(hotel_booking_df)
From the data variable description we see that the Distribution Channel category that tells us about Booking distribution.
The term “TA” means “Travel Agents” The term “TO” means “Tour Operators” This allows us to fill the missing values in the agents column with TO
# We replace the mising values i.e. for TO
# ---
# YOUR GOES BELOW
#
hotel_booking_df$agent[is.na(hotel_booking_df$agent)] <- "TO"
head(hotel_booking_df)
# We drop rows where there is no adult, baby and child as
# these records won't help us.
# ---
# YOUR GOES BELOW
#
hotel_booking_df <- subset(hotel_booking_df, adults!=0 | children!=0 | babies!=0)
head(hotel_booking_df)
# We replace missing children values with rounded mean value
# ---
# Hint i.e. use round()
# ---
# YOUR GOES BELOW
#
hotel_booking_df$children[is.na(hotel_booking_df$children)] <- round(mean(hotel_booking_df$children, na.rm = TRUE), 0)
head(hotel_booking_df)
# Checking for missing values in the dataframe
# ---
# YOUR GOES BELOW
#
sum(is.na(hotel_booking_df))
## [1] 478
# Converting the datatypes of the following columns from float to integer
# i.e. children, company, agent
# ---
# YOUR GOES BELOW
#
# hotel_booking_df$children <- lapply(hotel_booking_df$children, as.numeric)
# as.integer(hotel_booking_df$children)
head(hotel_booking_df)
# 1. How many bookings were cancelled?
# ---
# Visualisation: Barplot
hotel_booking_df %>%
group_by(reservation_status) %>%
summarise(is_canceled = sum(is_canceled))
ggplot(hotel_booking_df, aes(x= reservation_status, y= is_canceled))+
geom_col(stat = "identity", width = 0.5, fill= "blue") +
labs(
title = "Reservation Status",
x= "Status",
y= "Number"
) +
theme(
plot.title = element_text(hjust = 0.5, size=16, family="Ariel Black", color="purple")
)
## Warning: Ignoring unknown parameters: stat
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
# 2. What was the booking ratio between resort hotel and city hotel?
# ---
# Barplot of booking ratio between resort hotel and city hotel
resort_city_hotel <- hotel_booking_df %>%
count(hotel)
resort_city_hotel
ggplot(data=resort_city_hotel, aes(x="", y=n, fill=hotel)) +
geom_bar(
stat="identity", width=0.5, color = "white"
) +
coord_polar("y") +
geom_text(
aes(label = paste0(round(n / sum(n) * 100, 2), "%")), position = position_stack(vjust = 0.5)) +
scale_fill_brewer(palette="Set2")+
labs(
fill = "Hotel",
title = "Booking Between Resort Hotel and City Hotel"
)+
theme_void()
# 3. What was the percentage of booking for each year?
# ---
#
year_booking <- hotel_booking_df %>%
count(arrival_date_year)
year_booking
ggplot(data=year_booking, aes(x="", y=n, fill=factor(arrival_date_year))) +
geom_bar(
stat="identity", width=0.5, color = "cyan"
) +
coord_polar("y") +
geom_text(
aes(label = paste0(round(n / sum(n) * 100, 2), "%")), position = position_stack(vjust = 0.5)
) +
scale_fill_brewer(palette="Set1") +
labs(
fill = "Years",
title = "Booking for each year") +
theme_void()
# 4. Which were the most busiest months for hotels?
# ---
#
busy_months <- hotel_booking_df %>%
count(arrival_date_month)
busy_months
ggplot(busy_months, aes(x= arrival_date_month, y= n))+
geom_col(stat = "identity", width = 0.8, fill= "magenta") +
labs(
title = "Bookings per Month",
x= "Month",
y= "Number of bookings"
) +
theme(
plot.title = element_text(hjust = 0.5, size=16, family="Arial", color="blue")
)
## Warning: Ignoring unknown parameters: stat
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
# 5. From which top 3 countries did most guests come from?
# ---
# YOUR GOES BELOW
#
hotel_booking_df
countries <- hotel_booking_df %>%
group_by(country) %>%
summarise(guests = sum(adults)+sum(children)+sum(babies)) %>%
# arrange(desc(guests)) %>%
top_n(3)
## Selecting by guests
countries
ggplot(countries, aes(x= country, y= guests))+
geom_col(stat = "identity", width = 0.8, fill= "blue") +
labs(
title = "Top 3 Countries per Guests",
x= "Countries",
y= "Number of Guests"
) +
theme(
plot.title = element_text(hjust = 0.5, size=16, family="Arial", color="black")
)+
coord_flip()
## Warning: Ignoring unknown parameters: stat
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
# 6.a) How long do most people stay in hotels?
# b) By city and resort? Separate the data by hotel
# ---
#
stay_length <- hotel_booking_df %>%
group_by(hotel) %>%
summarise(total_stay_days = mean(stays_in_weekend_nights)+mean(stays_in_week_nights))
stay_length
ggplot(data = stay_length, aes(x= hotel, y= total_stay_days))+
geom_col(stat = "identity", width = 0.8, fill= "dark grey") +
labs(
title = "Days Stayed per Hotel",
x= "Hotel Type",
y= "Number of Days Stayed"
) +
theme(
plot.title = element_text(hjust = 0, size=16, family="Arial", color="blue")
)+
theme_classic()
## Warning: Ignoring unknown parameters: stat
# 7. Which was the most booked accommodation type (Single, Couple, Family)?
# ---
#
accommodation <- hotel_booking_df %>%
count(reserved_room_type)
accommodation
ggplot(data = accommodation, aes(x= reserved_room_type, y= n))+
geom_col(stat = "identity", width = 0.8, fill= "brown") +
labs(
title = "Bookings per Accomodation Type",
x= "Room Type",
y= "Number of Bookings"
) +
theme(
plot.title = element_text(hjust = 0.5, size=16, family="Arial", color="blue")
)+
coord_flip()
## Warning: Ignoring unknown parameters: stat
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
From the above analysis, below are our recommendations: A lot of cancellation do happen accross the hotels while only a few are a no show. Reasons for cancellation should be investigated so as to minimize on lost revenue
Twice as much guests prefer city hotels to resort hotels. Management should invest in city hotels to ensure they have sufficnent capacity
Booking are not consistent year on year and management should be prepared for flactuations in number of guests each year
August is the busisest month and management must be prepared to handle the high number of guests in busy months to tap in revenue Incentives can be provided during the less busy months to encourage bookings
Most guests come from Portugal, Great Britain and France
In this step, we review our solution and implement approaches that could potentially provide a better outcome. In our case, we could propose the following question that wasn’t answered in our solution because it couldn’t have greatly contributed to our recommendation.
# When should hotels provide special offers?
# ---
# YOUR GOES BELOW
#
not_busy_months <- hotel_booking_df %>%
count(arrival_date_month)%>%
top_n(-3)
## Selecting by n
not_busy_months
ggplot(not_busy_months, aes(x= arrival_date_month, y= n))+
geom_col(stat = "identity", width = 0.5, fill= "#3B7A57") +
labs(
title = "Least Bookings per Month",
x= "Month",
y= "Number of bookings"
) +
theme(
plot.title = element_text(hjust = 0.5, size=16, family="Arial", color="#4B3621")
)
## Warning: Ignoring unknown parameters: stat
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
Our observations: November, December, and January are the least busy months. This means most of the facilities in the hotel remain underutilized and management can seek to provide offers that incentivize guests to make bookings during these months
How does this observation tie to our solution? This observation helps management with an opportunity to further grow their revenue by attracting guests even during the low seasons
During this step, you rethink and propose other ways that you can improve your solution.
a). Did we have the right data? Yes, we had the right data which was from a credible source b). Do we need other data to answer our question? provided data was sufficient to answer our research questions, however, other data that could provide more insights such as reasons for the high number of cancellations would have been of great help c). Did we have the right question? Research questions formulated were adequate to provide sound recommendations